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ABSTRACT 

In this work, we developed a family-based database 
of UUCD (http://uucd.biocuckoo.org) for ubiquitin and 
ubiquitin-like conjugation, which is one of the most 
important post-translational modifications respon- 
sible for regulating a variety of cellular processes, 
through a similar E1 (ubiquitin-activating enzyme)- 
E2 (ubiquitin-conjugating enzyme)-E3 (ubiquitin- 
protein ligase) enzyme thioester cascade. Although 
extensive experimental efforts have been taken, an 
integrative data resource is still not available. 
From the scientific literature, 26 E1s, 105 E2s, 1003 
E3s and 148 deubiquitination enzymes (DUBs) were 
collected and classified into 1, 3, 19 and 7 families, 
respectively. To computationally characterize poten- 
tial enzymes in eukaryotes, we constructed 1, 1, 15 
and 6 hidden Markov model (HMM) profiles for E1s, 
E2s, E3s and DUBs at the family level, separately. 
Moreover, the ortholog searches were conducted 
for E3 and DUB families without HMM profiles. Then 
the UUCD database was developed with 738 E1s, 
2937 E2s, 46631 E3s and 6647 DUBs of 70 eukaryotic 
species. The detailed annotations and classifications 
were also provided. The online service of UUCD was 
implemented in PHP + MySQL + JavaScript + Perl. 

INTRODUCTION 

The 2004 Nobel Prize in Chemistry was awarded to Aaron 
Ciechanover, Avram Hershko and Irwin Rose for their 
seminal discovery of the ubiquitin conjugation that 
targets proteins for degradation in an ATP-dependent 
manner (1^4). Previously, ubiquitin was isolated as a 



heat-stable protein with 76 aa (3), while further analyses 
revealed that the 'ubiquitin kiss' functions as a molecular 
death-tag through the ubiquitination, which reversibly 
and covalently forms an isopeptide bond between the 
C-terminal (-GG) carboxyl group of a ubiquitin protein 
and the s-amino group of lysine residues or, less 
commonly, other types of residues of a substrate protein 
(4,5). The protein substrates can be modified by 
conjugating with mono- or polyubiquitins (4). Beyond 
degradation, non-proteolytic functions of ubiquitin conju- 
gation were also demonstrated (6,7). For example, the 
K63-linked ubiquitin chain is implicated in DNA repair 
and endocytosis (7), whereas monoubiquitination plays an 
important role in histone regulation, virus budding and 
DNA repair (6,7). 

The enzymatic process of the ubiquitin conjugation is a 
sequential three-step cascade (4). First, the C-terminal 
glycine (G) of ubiquitin is activated by a ubiquitin- 
activating enzyme (El) to form an El-Ub thioester in 
an ATP-dependent manner. Then the activated ubiquitin 
is transferred to the active site cysteine (C) residue of 
a ubiquitin-conjugating enzyme (E2). Finally, a 
ubiquitin-protein ligase (E3) removes the ubiquitin from 
the E2 and forms a covalent link between ubiquitin and a 
protein substrate (4). Also, ubiquitination is reversible by 
numerous deubiquitination enzymes (DUBs) (8). 
Analogous to ubiquitin, other components of superfamily 
of ubiquitin-like modifiers, such as SUMO, ISG15, Rubl/ 
Nedd8 and Atg8/12 are also adopted the similar catalytic 
procedures (9). The ubiquitin and ubiquitin-like conjuga- 
tion regulate a large number of cellular processes, such as 
cell cycle, signal transduction, apoptosis and autophagy 
(4,9), while aberrant modification is implicated in 
numerous pathologies, such as neurodegenerative dis- 
orders, inflammatory diseases and cancers (7,10). In this 
regard, identification of Els, E2s, E3s and DUBs is 
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fundamental for understanding the regulatory roles of 
ubiquitin and ubiquitin-like conjugation and provides 
candidate drug targets for further biomedical consider- 
ation (11). 

After the first El ligase was isolated in 1981 (12), identi- 
fication and classification of Els, E2s, E3s and DUBs had 
emerged to be a great challenge. In 2003, with the comple- 
tion of RIKEN FANTOM-2 project, the total 
ubiquitin-associated enzymes were identified as 835, 764, 
320, 785, 162 and 145 for Homo sapiens, Mus musculus, 
Drosophila melanogaster, Caenorhabditis elegans, 
Schizosaccharomyces pombe and Saccharomyces cerevisiae, 
respectively (13). In particular, there were 16 Els, 53 E2s, 
527 E3s and 74 DUBs characterized in H. sapiens and 
roughly classified by InterPro domains (13). Later, the 
number of human putative DUBs was refined as 95 and 
further classification was based on phylogenetic relations 
(8). In 2003, the first public database of PlantsUBQ was 
constructed with Els, E2s and E3s in Arabidopsis (14). 
Later, the SCUD database contained 11 E2s, 42 E3s, 20 
DUBs and 940 substrates in S. cerevisiae (15). In 2009, the 
plantsUPS was developed with Els, E2s and E3s in seven 
plants (16). Recently, the hUbiquitome database collected 
1 El, 12 E2s, 138 E3s, 17 DUBs and 279 substrates of 
H. sapiens from the scientific literature (17). Although a 
number of studies were performed, an integrative data 
resource is still not available. 

In this work, we manually collected 26 Els, 105 E2s, 
1003 E3s and 148 DUBs for ubiquitin and ubiquitin-like 
conjugation systems from the scientific literature 
(Table 1). Based on previously established rationales 
(8,9,18-23), we classified Els, E2s and DUBs into one, 
three and seven families, respectively. Also, we classified 
E3s into a hierarchical structure with five levels, including 
class, group, subgroup, family and single E3. Totally, we 
obtained 2 classes, 7 groups, 4 subgroups and 19 families 
for E3s. With HMMER (24), we totally constructed 1,1, 
15 and 6 hidden Markov model (HMM) profiles for Els, 
E2s, E3s and DUBs at the family level, respectively. Then 
we used the HMM profiles to computationally character- 
ize 738 Els, 2797 E2s, 43 881 E3s and 6516 DUBs from 70 
eukaryotic organisms, respectively. For families without 
HMM profiles, we additionally conducted an ortholog 
search to detect 140 E2s, 2750 E3s and 131 DUBs. The 
classification information was provided, while the detailed 
annotations from Ensembl (25) and UniProt (26) 



databases were also integrated. Finally, a comprehensive 
database of ubiquitin and ubiquitin-like conjugation 
(UUCD) was developed with 56 949 enzymes for ubiquitin 
and ubiquitin-like conjugation across 70 eukaryotic 
species. 

CONSTRUCTION AND CONTENT 

Data collection and curation 

Although there are up to 17 types of ubiquitin-like 
proteins that can covalently modify other molecules, 
only nine of them specifically recognize protein substrates 
(9). Thus, we searched the PubMed with multiple 
keywords, including 'ubiquitin', 'ubiquitination', 
'ubiquitylation 1 , 'SUMO', 'sumoylation', 'NEDD8 1 , 
'neddylation', 'Atg8', 'Apg8', 'Atgl2\ 'Apgl2', 'Urml', 
'ISG15', 'isgylation', 'UFM1' and 'FAT10', respectively. 
Totally, we collected 26 Els, 105 E2s, 1003 E3s and 148 
DUBs for ubiquitin and ubiquitin-like conjugation in eu- 
karyotes (Table 1 and Supplementary Table SI). From 
Ensembl (25) and UniProt (26) databases, the full-length 
sequences of the enzymes were obtained. The functional 
domain information was taken from the annotations in 
the UniProt database and further verified by searching 
the Pfam database (27). Moreover, the complete 
proteome sequences were downloaded for 70 eukaryotes, 
including 54 animals, 14 plants and 2 fungi, from Ensembl 
(release version 64, http://www.ensembl.org/), 
EnsemblPlants (release version 11, http://plants. ensembl. 
org/) and EnsemblFungi (release version 11, http://fungi. 
ensembl.org/), respectively (25). 

The Classification 

As previously described, the known Els can be distin- 
guished from other proteins by the occurrence of ThiF/ 
MoeB domains (9). Thus, all Els were classified into a 
single family. Also, the E2s were categorized into three 
families based on their functional domains, including 
UBC (ubiquitin-conjugating), UEV (ubiquitin-enzyme 
variant) and Other (unclassified) (18). Moreover, DUBs 
for ubiquitin were classified into five families, including 
UCH (ubiquitin C-terminal hydrolase), USP (ubiquitin- 
specific protease), OTU (ovarian tumor), Josephin 
(Machado-Joseph disease, MJD) and JAMM (JAB1/ 
MPN/Mov34 metalloenzyme) (8). In addition, the DUBs 



Table 1. The data statistics of known El, E2, E3 and DUB proteins 



Organism El E2 E3 DUB Total 



Homo sapiens 8 39 475 91 613 

Mus musculus 2 6 71 14 93 

Drosophila melanogaster 0 5 46 4 55 

Caenorhabditis elegans 5 6 57 2 70 

Saccharomyces cerevisiae 3 16 78 19 116 

Schizosaccharomyces pombe 2 4 39 6 51 

Arabidopsis thaliana 5 27 182 10 224 

Others 1 2 55 2 60 

Total 26 105 1003 148 1282 



From the scientific literature, we manually collected experimentally identified Els, E2s, E3s and DUBs, respectively. 
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for ubiquitin-like conjugations, such as SENPs for 
SUMO, were classified into the family of ULP 
(ubiquitin-like protease). The Other family contained 
unclassified DUBs (8). 

The classification of E3 ligases is complicated, because a 
considerable number of proteins participate in forming E3 
ligase complexes as adaptors without E3 activities (19-23). 
In this regard, we first classified E3-associated proteins 
into two classes as E3 activity and E3 adaptor (19-22). 
Enzymes in the E3 activity class were further categorized 
into four groups and seven families, including HECT 
(with one family of HECT), RING (two families as 
RING and U-box), N-recognin (three families as 
UBR-box, N-domain and Other) and Other (with one 
family of Other). Proteins in the E3 adaptor class were 
classified into 2 groups and 12 families, including Cullin 
RING (10 families as Cullin, F-box, SKP1, SOCS/VHL/ 
BC-box, Elon-B/C, BTB_3-box, BTB_Other, DWD, 
DDB1 and Other) and APC/C (two families as CDC20 
and APC/C). As different Cullins determine the specific 
compositions of distinct Cullin RING E3 ligase complexes 
(21,23), we further classified adaptors in the Cullin RING 
group into five subgroups, including SCF (Cullin 1 and 7), 
ECS (Cullin 2 and 5), BCR (Cullin 3), DCX (Cullin 4A/B) 
and Other (with one family of Other). 

Proteome-wide identification 

As the number of proteins is limited for several families, 
we totally constructed 1, 1, 15 and 6 HMM profiles for El, 
E2, E3 and DUB families, respectively. The functional 
domain sequences for each family were firstly aligned by 
MUSCLE (http://www.drive5.com/muscle/, version 
3.8.31), a widely used program for multiple sequence 
alignment (28). Then we used the hmmbuild program in 
the HMMER 3.0 package (http://hmmer.janelia.org/) (24) 
to construct HMM models. Moreover, the hmmsearch 
program of HMMER 3.0 (24) was used to search all 
protein sequences in 70 eukaryotic species. The default 
parameters were chosen for the three tools, and additional 
filters were used to improve the accuracy (presented in the 
'Discussion' section). Since one gene can generate multiple 
variant transcripts, the Ensembl Gene ID was adopted as 
the unique accession to avoid any redundancy. For one 
gene, only the protein with the lowest E-value was 
reserved. All HMM profiles are available at http://uucd. 
biocuckoo.org/faq.php. 

Here, we took 1282 identified Els, E2s, E3s and DUBs 
as the benchmark dataset to evaluate the prediction per- 
formance and robustness of the HMM identifications. For 
each family, the annotated sequences were taken as 
positive data (P), while all other proteins were regarded 
as negative data (N). The sensitivity (S„) and specificity 
(S p ) can be calculated as below: 



Sn 



TP 
TP+FN 



and Sp — 



TN 
TN+FP 



Both the self-consistency and leave-one-out validations 
were carried out. The receiver operating characteristic 
(ROC) curves were illustrated, and AROC (area under 
ROC) values were calculated for eight families, 



respectively (Supplementary Figure SI). The results sug- 
gested that our predictions are accurate and robust 
(Supplementary Figure SI). To ensure that all curated 
proteins can be correctly predicted and classified 
(S„ = 100%), we selected distinct cut-off values for differ- 
ent families (Supplementary Table S2). 

For the families without HMM profiles, we conducted 
the orthology search for 70 eukaryotes by using the 
reciprocal best-hit approach (29), which can efficiently 
identify ortholog pairs if two proteins in two different 
proteomes reciprocally find each other as the best hit, by 
the blastall program in the BLAST package (30). 

Totally, we computationally characterized 738 Els, 
2937 E2s, 46 631 E3s and 6647 DUBs from 70 eukaryotic 
species, whereas the heat map of the classifications and 
results for several major groups or families were illustrated 
by the ggplot2 program (http://had.co.nz/ggplot2/) in the 
R package (http://www.r-project.org/) (Figure 1). The 
detailed classifications and data statistics are available in 
Supplementary Table S3. 



USAGE 

The UUCD database was constructed in an easy-to-use 
manner. Here, we used human F-box/WD 
repeat-containing protein 1A (P-TrCP) as an example to 
describe the usage of UUCD. The Browse page is the 
major option for users to look through the UUCD con- 
veniently. Two strategies were implemented for browsing 
the data, including by species and by classifications 
(Figure 2). In the option of 'Browse by species', the 
right tree represents the phylogenetic relations of eukary- 
otic species in Ensembl, while the left tree represents the 
Ensembl taxonomy categories, including primates, 
rodents, laurasiatheria, afrotheria and so on (25) 
(Figure 2A). By clicking on the 'Homo sapiens' button, 
the El, E2, E3 and DUB families in H. sapiens can be 
shown (Figure 2A). Also, UUCD can be browsed by clas- 
sifications (Figure 2B). The left tree represents the hier- 
archical categories, whereas known 3D structures of Els, 
E2s, E3s or DUBs were taken from the PDB database (31) 
and present in right (Figure 2B). Since human (3-TrCP 
belongs to the F-box family, users can click on the 
'F-box' button to visualize the family information of 70 
eukaryotic species (Figure 2B). By either clicking on the 
'F-box' button in the page of human families (Figure 2A) 
or the 'Homo sapiens' button in the page of the F-box 
family (Figure 2B), the members in human F-box family 
can be available, while a brief description is shown for 
biological functions and regulatory roles of F-box con- 
taining proteins (Figure 2C). The UUCD ID (UUC-) 
was adopted for organizing the database, while the 
Ensembl Gene ID was used as the secondary accession 
(Figure 2C). The users can click on the 
'UUC-HoS-00457' to view the detailed information of 
human P-TrCP (Figure 2D). 

The UUCD can be searched with one or multiple 
keywords (Figure 3A). For example, if the keyword of 
'TRCP' is inputted and submitted, the results will be 
shown in a tabular page, with the features of UUCD ID 
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Figure 1. The heat map of the classifications and protein numbers of Els, E2s, DUBs and several major groups for E3 ligases. 



and protein/gene names/aliases (Figure 3A). Furthermore, 
three additional advance options were provided, such as 
(i) advance search, (ii) BLAST search and (iii) HMM 
search. In advance search option, up to three search 
terms can be specified and submitted to search the 
precise information (Figure 3B). The second option, 
BLAST search, was designed for querying the related in- 
formation in UUCD by protein sequences. The blastall 
program of NCBI BLAST packages (30) was integrated 
in the database. Users can input a protein sequence in 
FASTA format to search identical or homologous 
proteins (Figure 3C). (iii) In HMM search option, a 
protein sequence in FASTA format can be inputted and 
scanned with 1, 1, 15 and 6 HMM profiles of El, E2, E3 
and DUB families, respectively. If the protein is 
determined as a ubiquitin-associated enzyme, the classifi- 
cation and detection information will be present 
(Figure 3D). 

DISCUSSION 

As one of the most important post-translational modifica- 
tions of proteins, reversible ubiquitin and ubiquitin-like 
conjugation have been implicated in almost all aspects of 
biological processes and functions, and determines the 
cellular dynamics and plasticity (1,4,5,7-9). Aberrances of 
ubiquitin and ubiquitin-like conjugation systems are highly 
implicated in a variety of diseases and cancers (7,10). In this 
regard, identification and classification of Els, E2s, E3s and 
DUBs is fundamental for dissecting the molecular 



mechanisms and regulatory roles of the conjugation 
(4,8,9), analyzing the phylogenetic relations of enzymes 
(8,23), modeling E3-substrate networks (32) and providing 
potent candidates for drug design (11). Developing a com- 
prehensive database with detailed annotation and classifi- 
cation information has emerged to be an urgent challenge. 

Previously, a number of computational efforts were taken 
for systematically characterizing proteins in the 
ubiquitin-proteasome system (UPS) (8,13-17). For example, 
Semple et al. (13) performed a genome- wide study by iden- 
tifying potential Els, E2s, E3s and DUBs from four animals 
and two fungi. From the scientific literature, SCUD (15) and 
hUbiquitome (17) databases collected experimentally verified 
enzymes for S. cerevisiae and H. sapiens, respectively. The 
proteins in the two databases were fully covered by our 
benchmark dataset (Supplementary Table SI). To the best 
of our knowledge, currently the most comprehensive 
database was plantsUPS, which contains predicted Els, 
E2s and E3s for up to seven plants, whereas the experimental 
information was not present (16). 

By collecting known enzymes of ubiquitin and ubiquitin- 
like conjugation systems, we classified them based on dis- 
tinguishable functional domains (Supplementary Table S3). 
Although several general databases such as Pfam (27), 
InterPro (33) and SMART (34) contain most of these 
domains, we re-constructed HMM profiles with our bench- 
mark dataset to promise the prediction performance, and 
additional filters were used for accurate classification 
(Supplementary Figure SI). For example, the HMM 
profiles of UBC and UEV domains are highly similar and 
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cannot be distinguished by hmmsearch (24). However, the 
active site cysteine residue locating ~70-90 aa for ubiquitin 
coupling does not exist in UEV domains (18) 
(Supplementary Figure S2). This single rule was adopted 
for accurately separating UBC and UEV domains 
(Supplementary Figure SI). Also, the SOCS/VHL/ 
BC-box family is less conserved and cannot be defined 
with a single domain (35). Thus, proteins containing at 
least one of SOCS, VHL or BC-box domains were classified 
into this family. Furthermore, recent analysis revealed a 
32 aa '3-box' motif following BTB domain is essential for 
Cullin binding, although BTB proteins without the motif 
might interact with Cul3 in alternative mechanisms (36). 
The proteins predicted with both BTB domain and 3-box 
motif were classified into the BTB_3-box family 
(Supplementary Figure S3). In addition, experimental 
studies suggested that the highly conserved motif 
EX n HXHX 10 D is essential for the DUB activity of the 
JAMM family (37,38). The predicted JAMM proteins 
without this motif were discarded. 

Together with the ortholog search for families without 
HMM profiles, we systematically identified 738 Els, 2937 
E2s, 46631 E3s and 6647 DUBs from 70 eukaryotic or- 
ganisms, with an average number of 813.6 total enzymes 



per organism (Supplementary Table S3). Although there 
are 672.5 enzymes encoded in one animal, the average 
number of plant enzymes is > 2-fold (1441.5) higher in 
animals (Supplementary Table S3). Also, the numbers of 
animal or plant enzymes in the same group or family can 
be greatly different (Figure 1). For example, we totally 
identified 495 Els in 54 animals with an average number 
of 9.2 per species, whereas up to 226 Els were detected in 
14 plants with an average number of 16.1 (Supplementary 
Table S3). However, there were 1803 SOCS/VHL/BC-box 
proteins identified in animals (33.4 per organism), while 
only 19 SOCS/VHL/BC-box proteins were detected in 
plants (Supplementary Table S3). Taken together, our 
systematic analysis demonstrated the complexity and di- 
versity of enzymes for ubiquitin and ubiquitin-like conju- 
gation in eukaryotes. 

For future plans, at least three additional types of ubi- 
quitin-associated proteins will be collected and systemat- 
ically identified. First, the polyubiquitin ligase (E4) was 
demonstrated to be responsible for assembling 
multiubiquitin chains (39). It is still not known how 
many E4 ligases are encoded in eukaryotic genomes. 
As the number of known E4s is quite limited, such infor- 
mation was not included in UUCD. Second, accumulative 
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Figure 3. The search and advance options. (A) The database can be queried with one or multiple keywords. (B) Advance search allows users to input 
up to three terms for the precise search. (C) Blast search option was designed for searching database with one protein sequence in FASTA format. 
(D) HMM search option will scan the inputted protein sequence with pre-constructed HMM profiles. 



experiments revealed that a large number of ubiquitin- 
binding domain (UBD)-containing proteins (> 150) can 
non-covalently interact with ubiquitin (40). Thus, UBD 
proteins can dynamically interact with modified substrates 
in a ubiquitination-dependent manner and participate in 
the ubiquitin network (40). Recently, systematic identifica- 
tion of ubiquitinated proteins with modified sites has 
emerged to be a hot topic (2). Such information will be 
carefully collected and curated in the near future. Taken 
together, although more information remains to be 
integrated, the comprehensive UUCD with Els, E2s, E3s 
and DUBs across 70 eukaryotes can serve as a useful 
resource for further researches. The UUCD database will 
be continuously updated, when the proteome sequences of 
more species are available. 
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